Introduction to Video Conferencing
Understand the difference between everyday client-server communication and video conferencing.
Introduction#
The Internet is designed to allow communication between clients and servers. Typically, the client initiates a request, and the server responds to it. The flow is also similar for real-time applications where users communicate with each other. However, because the request of each user relays through an intermediate server, an additional lag or delay is inevitable. When it comes to video conferencing, the flow may not be optimal, especially when the server is too far away from the user, because it can add delay and lag due to the size of the data and result in glitchy playback.
We know that user-perceived latency mainly depends on the transfer time and processing time. Usually, live streams don’t require a lot of processing, and are forwarded from servers that can be miles away from users. Let’s discuss some available ways to solve this problem of extra miles.
Real-time communication#
Real-time communication requires the shortest path to transmit data, which is possible through peer-to-peer communication. Still, it becomes problematic when participants are behind different Network Address Translation (NAT) areas. To overcome this problem, a signaling server is used that allows both the communicating parties to share their multimedia session descriptions that include the port, IP addresses and other information essential for the communication. Let's see how signaling helps in sharing this multimedia session information.
Note: A multimedia session is used to identify media-related metadata essential for media transmission, processing, etc. It is also helpful for identifying and enabling device-specific features compatible with other participants.
Points to Ponder
Question 2
Why is peer-to-peer connection not feasible for direct communication between clients originating from different networks?
Usage of peer-to-peer technology may be possible within the same local area network (LAN). But for users (specifically the ones using IPv4) who are behind a NAT (Network Address Translation (NAT) replaces the private address of the communicating device with the public IP of the LAN gateway) or some other kind of edge devices monitoring and controlling the traffic flow, peer-to-peer connections can be more challenging to maintain. For example, private networks are usually protected by firewalls that block incoming requests for better security and further restrict communication to the client-server model.
Note: There are some workarounds, such as Skype Protocol (not publically available), WebRTC, and so on, which allow peer-to-peer communication over the Internet.
2 of 2
Signaling and connecting#
Signaling refers to the successful initiation of a multimedia session between participants willing to participate in the audio/video conference. However, before the communication starts, clients must exchange and agree on multimedia session information, such as communication addresses (IP and port), media descriptions (text, audio, video, etc.), and other metadata. This information is usually sent via the session description protocol (SDP).
The session description protocol (SDP) is a format for describing session information in a standardized form. It is just a description format, must be delivered using protocols like Session Initiation Protocol (SIP) or Session Announcement Protocol (SAP) , which are specially designed to share session information between participants. Let's take the session initiation protocol (SIP) as an example, due to its versatility, and go over how SDPs are exchanged between different participants.
Session initiation protocol (SIP)#
Session initiation protocol (SIP) is a set of guidelines for peer-to-peer communication to share, maintain, and terminate audio/video conferences. It is not a complete vertical stack, and it usually works with other protocols, such as (RTP, RTSP, HTTP, etc.) to provide a comprehensive service. It uses a network of proxy servers, which help discover participants. Participants can negotiate their sessions using the SDP to establish a connection. It is also helpful for adding and removing participants to an existing session, for example, a multicast conference session.
The following are the commonly used SIP methods used to send requests to the SIP server when initiating a session:
- REGISTER: This method registers the contact information of users and creates a map of the public URI to the contact information.
- INVITE: This method initiates a session that eventually reaches the registered user, who can accept or reject the invitation.
- ACK: This method acknowledges a request to return a status similar to the status code in an HTTP response.
- CANCEL: This method is used to cancel an initiated request, and the server generates an error response for that request.
- BYE: This method terminates the current session.
- OPTIONS: This method is used for querying information from the SIP servers.
The following illustration shows two clients exchanging their session information using signaling (in this case, SIP). Both clients agree to share data (steps 1 and 2), but to exchange actual data (video frames in our case), we need to establish an interactive connection, as described in step 3 in the illustration below:
Now let’s discuss the different protocols used for exchanging data between the communicating parties.
Data exchange protocols#
There are several protocols that provide bidirectional data flow directly between two devices (endpoints), namely WebSockets, WebRTC, H323, and so on.. Among these, WebRTC is one of the most popular peer-to-peer video conferencing protocols. It's a protocol stack that works on top of other protocols (SRTP, SCTP, DTLS, etc.). It was originally developed to enable multimedia conferencing directly from the browser. Later, some tools and technologies were added to its stack for native application support on different platforms.
Peer-to-peer communication protocols (like WebRTC) introduce powerful capabilities for applications requiring large data transfer in real-time, but everything has its cost. Let's discuss the limitations in the next section.
Scaling real-time communication#
While we have established that SIP facilitates the exchange of session information and WebRTC for exchanging data, these are still not enough to achieve a scalable solution when it comes to audio/video conferencing. Peer-to-peer connections are great for small groups, say five to ten participants, but when the number increases, a mesh of peer-to-peer connections is created, which is resource intensive for each participant, as shown in the illustration below:
Point to Ponder
Question
Why is the total number of streams in a mesh topology and not ?
We must take care when counting the total number of connections in a network mesh topology, because we can count a connection multiple times when performing a summation of connections per client. For example, when counting the connections associated with Client 1 and Client 2 in the figure above, we might count the direct connection between Client 1 and Client 2 twice, resulting in the total number of connections on the network. To avoid this duplication, we can simply divide the result by two to get the actual number.
As shown above, the peer-to-peer paradigm is not suitable for scaling systems. Therefore, for larger groups, there is another approach called Multipoint Control Unit (MCU), which receives incoming streams from each client, merges them into one stream according to defined settings, and sends one stream back to each participant. The working of an MCU server is given in the illustration below:
The approach using the MCU server is not optimal because we need to process the incoming streams before compiling them into one outgoing stream, which can cause considerable lag during a multicast stream with many participants.
Let's now discuss another approach called Selective Forwarding Unit (SFU), which receives incoming streams from each client and selectively forwards them to other participants. For example, if there are
Note: Simulcast SFU (SSFU) is an extension of SFU where each participant also sends their streams in different resolutions that they can support. The simulcast SFU then adaptively forwards the stream to each client based on available bandwidth and the maximum resolution they can handle.
From the above discussion, we can conclude that peer-to-peer connection is the shortest route for the data flow, but it creates
Zoom uses differentiated services field codepoints (DSCP) to prioritize its traffic at the network layer to maintain the quality of service. Refer to RFC 2474 to learn how DSCP works.
Geographically distributed media server#
Companies such as Zoom, Google, and Microsoft, have joint ventures with other companies and geographically distribute their media servers (MCU, SFU, Simulcast, and so on.) to improve their services' performance.
The majority of users collaborate in their geographical area, and distributing media servers in their local area helps us to achieve reliable video transfer with low latency.
Summary#
The following table summarizes the different techniques and protocols discussed in this lesson:
Name | Type | Description |
SDP | Protocol | A standard of multimedia session information |
SIP | Protocol | A set of rules to share, maintain, and terminate multimedia sessions |
WebRTC | Protocol | The simple and easy peer-to-peer data exchange protocol |
H323 | Protocol | The complex but fine-tuned peer-to-peer communication |
MCU | Relay | Takes an input stream and sends an output stream Requires computational power to merge incoming streams on the go |
SFU | Relay | Takes an input stream and sends multiple output streams Requires high bandwidth to send multiple high-resolution data streams |
SSFU | Relay | Takes multiple input streams and sends multiple output streams Can adapt to network conditions |
In this lesson, we discussed some techniques used for real-time multicasting. In the next lesson, we’ll make some key decisions that will help us to develop an efficient API design for a real-time video conferencing application like Zoom.
Requirements of the Zoom API
Zoom API Design Decisions